---
title: "Post-stratification vs. Aggregate Regression Adjustment"
author: "Shiro Kuriwaki"
output: 
  rmarkdown::html_vignette:
    fig_width: 8
    fig_height: 4
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r, message=FALSE}
library(tidyverse)
library(scales)
library(brms)
library(synthjoint)
library(ccesMRPrun)
library(ccesMRPprep)
library(ccesMRPviz)
```


Variables like party ID and religion are not available in the population target. This is a major challenge in MRP (see Leemann and Wasserfallen, [2017](https://doi.org/10.1111/ajps.12319)).


# Approach 1: Aggregate Regression

One option that researchers often use is to include a area-level proportion and simply "control" for that aggregate variable in the regression. This is _not_ the same as poststratification, but we believe that it helps to account for this variable in some way.  (An alternative option, which I examine in the next section is to impute this to the population target through a machine learning model trained on individual data. This strategy introduces some modeling error but still allows for post-stratification.)

Let's explore this proposition with a simple test case.  Suppose we want to know the _proportion of a CD's adults that have a high school degree or less_. This quantity is known in the aggregate and in some joint tables through the ACS, so we have a ground truth in this simulation. 


First create the dummy variable of interest.


```{r}
acs_GA <- acs_GA %>%
  mutate(is_HS = as.numeric(educ == "HS or Less"),
         is_BA = as.numeric(educ %in% c("4-Year", "Post-Grad")))

```

`acs_GA` is a sample ACS population table. Education is already in the data, but suppose we do not know it. 

With this binary variable, let's compute the CD-level quantity of interest.

```{r}
val <- acs_GA %>%
  count(cd, is_HS, is_BA, wt = count) %>%
  group_by(cd) %>%
  summarize(prop_HS_truth = sum(is_HS*n)/sum(n))

acs_df <- left_join(acs_GA, val)

```

Now suppose we have survey data we want to use for MRP. 

```{r}
cces_df <- cces_GA %>%
  # same as ACS manipulation
  mutate(is_HS = as.numeric(educ == "HS or Less")) %>%
  group_by(cd) %>%
  # suppose you know the population aggregate
  left_join(distinct(acs_df, cd, prop_HS_truth), by = "cd") %>%
  ungroup()
```

we merge `prop_HS_truth` here like a CD-level election result: we will _suppose_ that we don't have the `educ` variable in the ACS but we do know the CD-level aggregate, so we merge that in. Notice that this duplicates the rows for every CCES respondent in a given CD.

Not surprisingly, the CD-level aggregate is predictive of the outcome (coefficient 0.85, t-stat of 7).



Also notice that for this example, we are basically giving away the answer (`prop_HS_truth` is the quantity of interest, but we are going to throw it in the regression). Usually you have a predictive proxy only, like when you want to estimate the CD-level voteshare and you only have the lagged voteshare. We are giving away the answer to give the "regression" strategy its best case scenario.


Now let's try three model specifications:

```{r}
# formula specs ----
mrp_forms <- c(
  "omitted"   = is_HS ~ (1|cd),
  "mean_only" = is_HS ~ (1|cd) + prop_HS_truth,
  "oracle"    = is_HS ~ (1|cd) + (1|educ)
)
```

`"omitted"` is a version with no relevant variables in the MRP. `"prop_HS_truth"` is where we will control for the CD-level predictor but we don't have the individual level education. `"oracle"` is the model where we would _post-stratify_ on education -- this is what we would do if we have both education in the survey and population joint distribution.


The ccesMRPrun package will do quick MRP in one step with the `mrp_onestep()` function (see documentation). Let's run MRP with these three specs, and bind the results.


```{r}
# run MRP -- only change formulas at each step
mrp_df <- map_dfr(.x = mrp_forms,
                  .f = function(x) {
                    mrp_onestep(x,
                                cces_df,
                                poststrat_tgt = acs_df,
                                weight_var = "weight",
                                add_on = val, 
                                verbose = FALSE,
                                .cores = 2,
                                .backend = "cmdstanr")},
                  .id = "spec")
mrp_df
```


Now let's plot the MRP results. 

```{r}
mrp_plot <- mrp_df %>%
  mutate(spec = fct_inorder(spec))

scatter_45(mrp_plot,
           prop_HS_truth,
           p_mrp_est,
           lbvar = p_mrp_050,
           ubvar = p_mrp_900,
           by_form = ~spec,
           show_error = TRUE,
           by_labels = c(omitted = "~ (1|cd)",
                         mean_only = "~ (1|cd) + proportion HS",
                         oracle = "~ (1|cd) + (1|education)"),
           xlab = "True Proportion of Adults with High School - only Education",
           ylab = "MRP Estimate") +
  labs(caption = "Note: Estimand is proportion of CD that has a high school education or less.
       First model does not account for education.
       Third model poststratifies on education, and is therefore perfectly correct.
       Second model is only given CD-level proportion of High School-only graduates (the truth).")

```

Notice that the second model where we only control for the CD-level aggregate will tighten the estimates around a 45 degree angle, but it does _not_ make the estimates more representative. There is still a bias of the survey undersampling HS-only voters, and controlling for an aggregate does not solve this question.


With a closer look, we see that including the aggregate variable worsens some estimates in this case. High education districts get their estimates pulled away from the truth, while low education districts improve. This again seems to be a function of pooling where the errors become more homogeneous. 

```{r, echo = FALSE}
mrp_plot %>% 
  filter(spec %in% c("omitted", "mean_only")) %>% 
  transmute(spec, 
            cd, 
            prop_HS_truth,
            error = p_mrp_est - prop_HS_truth) %>% 
  pivot_wider(id_cols = c(cd, prop_HS_truth), names_from = spec, values_from = error) %>% 
  ggplot(aes(omitted, mean_only, color = prop_HS_truth)) +
  geom_point() + 
  geom_abline(linetype = "dashed") +
  scale_x_continuous(labels = unit_format(scale = 100, accuracy = 1, suffix = "pp")) +
  scale_y_continuous(labels = unit_format(scale = 100, accuracy = 1, suffix = "pp")) +
  coord_equal() +
  theme_bw() +
  scale_color_binned(type = "viridis", labels = percent_format(accuracy = 1)) +
  annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.1,  label = "Aggregate-only model\nImproves estimates", size = 2.5) +
  annotate("text", x = Inf, y = -Inf, hjust = 1.1, vjust = -0.1,  label = "Aggregate-only model\nWorsens estimates", size = 2.5) +
  labs(x = "Error from Omitted Variable Model",
       y = "Error from Aggregate-Only Model",
       color = "Actual Percent\nHigh School Only") +
  theme(axis.text = element_text(color = "black"))
  
```



# Approach 2: Synthetic Joint Estimation


Instead of including the aggregate information, others might opt to model a synthetic joint distribution. The ccesMRPprep function provides several methods to compute such a synthetic table.

```{r}
acs_syn_mlogit <- synth_mlogit(educ ~ female + age,
                               microdata = cces_GA,
                               poptable = acs_df,  # internally will be treated as if we don't know education
                               area_var = "cd")
```

On average, this produces a model where the _margins_ are smoothed out nationally. In the CCES survey sample, `r percent(mean(cces_GA$educ == "HS or Less"))` have a "HS or Less" degree, so it seems like the numbers match that.  This is not surprising because there are CD "fixed effects" not captured by gender and age, but it is nonethless troubling for MRP.


The ccesMRPprep package provide two other approaches to estimating the joint -- these incorporate another source of information, which is the margins that are available. 

```{r}
edu_margins <- collapse_table(acs_GA, area_var = "cd", X_vars = "educ", 
                              count_var = "count", new_name = "count")
edu_margins
```

Given this data that is simply the marginal distribution of education in each CD, one option is to simply take the product assuming independence

```{r}
acs_syn_prod <- synth_prod(educ ~ female + age, 
                           poptable = acs_GA,
                           newtable = edu_margins,
                           area_var = "cd")
```

and another is to first comibine the survey modeling then fix _those_ margins to the known population margins. 

```{r}
acs_syn_fix <- synth_smoothfix(educ ~ female + age, 
                               microdata = cces_GA,
                               poptable = acs_GA, 
                               fix_to = edu_margins,
                               area_var = "cd")
```

See the ccesMRPprep vignette for validation.

In any case, this allows us to run MRP using this "synthetic" target. Now let's fix the specification and vary the targets

```{r}
syn_datasets <- list(
  "survey"       = acs_syn_mlogit,
  "product"      = acs_syn_prod,
  "surveyfix"    = acs_syn_fix
)
```


```{r}
mrsp_df <- map_dfr(.x = syn_datasets,
                   .f = function(x) {
                     mrp_onestep(is_HS ~ (1|educ) + (1|cd),
                                 cces_df,
                                 poststrat_tgt = mutate(x, count = as.integer(count)),
                                 weight_var = "weight",
                                 add_on = val, 
                                 verbose = FALSE,
                                 .cores = 2,
                                 .backend = "cmdstanr")},
                   .id = "synth")

```

And examine the outcome.

```{r, echo  = FALSE}
mrsp_df %>% 
  mutate(synth = recode_factor(synth, 
                               survey = "Survey Model\n(synth_mlogit)", 
                               product = "Simple Product\n(synth_prod)",
                               surveyfix = "Survey + Rake\n(synth_smoothfix)")) %>% 
  scatter_45(prop_HS_truth,
             p_mrp_est,
             by_form = ~synth,
             lbvar = p_mrp_050,
             ubvar = p_mrp_900,
             show_error = TRUE, 
             xlab = "True Proportion of Adults with High School - only Education",
             ylab = "MRP Estimate")
```


The first estimate look bad.  If the imputation is faulty, it is not a good idea to use that to post-stratify, either. The only reason the second and third estimates are perfect is because we asked MRP _for an outcome_ for which we supplied the joint for.  If we were estimating a different outcome for which the margins are not known (which in practice is always the case), the values would change, of course.


